pandas简介:pandas是强大的数据分析和处理工具
➢ 快速、灵活、富有表现力的数据结构: DataFrame数据框和Series系列
➢ 支持类似SQL的数据增、删、查、改
➢ 带有丰富的数据处理函数
➢ 支持时间序列分析功能
➢ 支持灵活处理缺失数据

读取不同数据源的数据¶

读取文本文件¶

pandas.read_csv()¶

image.png

➢ sep参数是指定文本的分隔符的,如果分隔符指定错误,在读取数据的时候,每一行数据将连成一片。
➢ header参数是用来指定列名的,如果是None则会添加一个默认的列名。
➢ encoding代表文件的编码格式,常用的编码有utf-8、utf-16、gbk、gb2312、gb18030等。如果编码 指定错误数据将无法读取,IPython解释器会报解析错误。

In [1]:
import pandas as pd
In [2]:
# 'meal_order_info.csv' 表示文件路径
# encoding='GBK'   指定编码格式
pd.read_csv('meal_order_info.csv',encoding='GBK').head()
Out[2]:
info_id emp_id number_consumers mode dining_table_id dining_table_name expenditure dishes_count accounts_payable use_start_time ... lock_time cashier_id pc_id order_number org_id print_doc_bill_num lock_table_info order_status phone name
0 417 1442 4 NaN 1501 1022 165 5 165 2016/8/1 11:05:36 ... 2016/8/1 11:11:46 NaN NaN NaN 330 NaN NaN 1 18688880641 苗宇怡
1 301 1095 3 NaN 1430 1031 321 6 321 2016/8/1 11:15:57 ... 2016/8/1 11:31:55 NaN NaN NaN 328 NaN NaN 1 18688880174 赵颖
2 413 1147 6 NaN 1488 1009 854 15 854 2016/8/1 12:42:52 ... 2016/8/1 12:54:37 NaN NaN NaN 330 NaN NaN 1 18688880276 徐毅凡
3 415 1166 4 NaN 1502 1023 466 10 466 2016/8/1 12:51:38 ... 2016/8/1 13:08:20 NaN NaN NaN 330 NaN NaN 1 18688880231 张大鹏
4 392 1094 10 NaN 1499 1020 704 24 704 2016/8/1 12:58:44 ... 2016/8/1 13:07:16 NaN NaN NaN 330 NaN NaN 1 18688880173 孙熙凯

5 rows × 21 columns

存储文本文件¶

DataFram.to_csv()¶

image.png

In [3]:
data=pd.read_csv('meal_order_info.csv',encoding="gbk")
# data
# index=None  表示取消编号
data.to_csv('meal_order_info_save_by_to_csv.csv',
            index=None)

读取Excel文件¶

pandas.read_excel()¶

image.png

In [4]:
pd.read_excel('meal_order_detail.xlsx').head()
Out[4]:
detail_id order_id dishes_id logicprn_name parent_class_name dishes_name itemis_add counts amounts cost place_order_time discount_amt discount_reason kick_back add_inprice add_info bar_code picture_file emp_id
0 2956 417 610062 NaN NaN 蒜蓉生蚝 0 1 49 NaN 2016-08-01 11:05:36 NaN NaN NaN 0 NaN NaN caipu/104001.jpg 1442
1 2958 417 609957 NaN NaN 蒙古烤羊腿_x000D_\n_x000D_\n_x000D_\n 0 1 48 NaN 2016-08-01 11:07:07 NaN NaN NaN 0 NaN NaN caipu/202003.jpg 1442
2 2961 417 609950 NaN NaN 大蒜苋菜 0 1 30 NaN 2016-08-01 11:07:40 NaN NaN NaN 0 NaN NaN caipu/303001.jpg 1442
3 2966 417 610038 NaN NaN 芝麻烤紫菜 0 1 25 NaN 2016-08-01 11:11:11 NaN NaN NaN 0 NaN NaN caipu/105002.jpg 1442
4 2968 417 610003 NaN NaN 蒜香包 0 1 13 NaN 2016-08-01 11:11:30 NaN NaN NaN 0 NaN NaN caipu/503002.jpg 1442

DataFrame.to_excel()¶

image.png

In [5]:
data=pd.read_excel('meal_order_detail.xlsx').head(10)
# data
data.to_excel('meal_order_detail_save_by_to_excel.xlsx',
           index=None)

DataFrame 常见操作¶

两种数据框¶

Series¶

Series是一个一维序列,相当于python中的list

In [6]:
pd.Series(['Jason','Jackson','Black','Jack'])
Out[6]:
0      Jason
1    Jackson
2      Black
3       Jack
dtype: object
In [7]:
pd.Series([['Jason','Jsckson','Jack','Black'],[1,2,3,4]])
Out[7]:
0    [Jason, Jsckson, Jack, Black]
1                     [1, 2, 3, 4]
dtype: object

DataFrame¶

DataFrame是一个二维序列

In [8]:
import numpy as np
In [9]:
data=np.random.randint(1,9,16).reshape(4,4)
# data
index=['AI','BI','CI','DI']
columns=['AC','BC','CC','DC']
# index 表示行索引
# columns  表示列索引
pd.DataFrame(data,index=index,columns=columns)
Out[9]:
AC BC CC DC
AI 7 6 5 1
BI 8 8 4 1
CI 4 5 8 7
DI 2 4 8 8
In [10]:
# pd.DataFrame?

DataFrame常见属性¶

image.png

In [11]:
data=np.random.randint(1,9,16).reshape(4,4)
# data
index=['AI','BI','CI','DI']
data=pd.DataFrame(data,index=index,columns=columns)
data
Out[11]:
AC BC CC DC
AI 2 1 1 5
BI 1 8 4 4
CI 6 6 6 7
DI 4 3 6 3
In [12]:
data.values
Out[12]:
array([[2, 1, 1, 5],
       [1, 8, 4, 4],
       [6, 6, 6, 7],
       [4, 3, 6, 3]])
In [13]:
data.index
Out[13]:
Index(['AI', 'BI', 'CI', 'DI'], dtype='object')
In [14]:
data.columns
Out[14]:
Index(['AC', 'BC', 'CC', 'DC'], dtype='object')
In [15]:
data.dtypes
Out[15]:
AC    int32
BC    int32
CC    int32
DC    int32
dtype: object
In [16]:
data.size
Out[16]:
16
In [17]:
data.ndim
Out[17]:
2
In [18]:
data.shape
Out[18]:
(4, 4)

DataFrame相关操作¶

CRUD(create、read、update、delete)

read¶

image.png

常见操作
head、tail、iloc(index local)、loc(local)

In [19]:
data=np.random.randint(1,9,16).reshape(4,4)
# data
index=['AI','BI','CI','DI']
data=pd.DataFrame(data,index=index,columns=columns)
data
Out[19]:
AC BC CC DC
AI 3 7 8 7
BI 2 2 8 3
CI 2 8 2 5
DI 1 7 1 3
In [20]:
# 默认访问前5行
data.head(2)
Out[20]:
AC BC CC DC
AI 3 7 8 7
BI 2 2 8 3
In [21]:
# 默认访问后5行
data.tail(2)
Out[21]:
AC BC CC DC
CI 2 8 2 5
DI 1 7 1 3
In [22]:
data
Out[22]:
AC BC CC DC
AI 3 7 8 7
BI 2 2 8 3
CI 2 8 2 5
DI 1 7 1 3
In [23]:
# 通过行列索引访问元素
data.iloc[1,1]
Out[23]:
2
In [24]:
data.iloc[2:,2]
Out[24]:
CI    2
DI    1
Name: CC, dtype: int32
In [25]:
# 通过行列名称访问元素
data.loc['BI','BC']
Out[25]:
2
In [26]:
data.loc['BI':,'AI':]
Out[26]:
BC CC DC
BI 2 8 3
CI 8 2 5
DI 7 1 3

update¶

在数据访问的基础上,重新赋值

In [27]:
index=['AI','BI','CI','DI']
columns=['AC','BC','CC','DC']
data=pd.DataFrame(
    np.random.randint(1,8,16).reshape(4,4),
    index=index,
    columns=columns
)
In [28]:
data.loc['AI','AC']=99
data
Out[28]:
AC BC CC DC
AI 99 7 1 7
BI 6 5 2 6
CI 5 5 5 3
DI 4 4 4 7
In [29]:
data.loc['BI','AC':]=[1,2,3,4]
data
Out[29]:
AC BC CC DC
AI 99 7 1 7
BI 1 2 3 4
CI 5 5 5 3
DI 4 4 4 7
In [30]:
data.loc['CI','AC':]=np.nan
data
Out[30]:
AC BC CC DC
AI 99.0 7.0 1.0 7.0
BI 1.0 2.0 3.0 4.0
CI NaN NaN NaN NaN
DI 4.0 4.0 4.0 7.0

按列增加数据¶

In [31]:
data
Out[31]:
AC BC CC DC
AI 99.0 7.0 1.0 7.0
BI 1.0 2.0 3.0 4.0
CI NaN NaN NaN NaN
DI 4.0 4.0 4.0 7.0
In [32]:
data['EC']=13
data
Out[32]:
AC BC CC DC EC
AI 99.0 7.0 1.0 7.0 13
BI 1.0 2.0 3.0 4.0 13
CI NaN NaN NaN NaN 13
DI 4.0 4.0 4.0 7.0 13

删除数据¶

DataFrame.drop()¶

image.png

In [33]:
data
Out[33]:
AC BC CC DC EC
AI 99.0 7.0 1.0 7.0 13
BI 1.0 2.0 3.0 4.0 13
CI NaN NaN NaN NaN 13
DI 4.0 4.0 4.0 7.0 13
In [34]:
# axis 来表示删除的是行还是列
data.drop('EC',axis=1)
Out[34]:
AC BC CC DC
AI 99.0 7.0 1.0 7.0
BI 1.0 2.0 3.0 4.0
CI NaN NaN NaN NaN
DI 4.0 4.0 4.0 7.0
In [35]:
data.drop('CI',axis=0)
Out[35]:
AC BC CC DC EC
AI 99.0 7.0 1.0 7.0 13
BI 1.0 2.0 3.0 4.0 13
DI 4.0 4.0 4.0 7.0 13

统计函数¶

image.png

image.png

In [36]:
data=np.random.randint(1,9,16)
index=['AI','BI','CI','DI']
columns=['AC','BC','CC','DC']
data=pd.DataFrame(data.reshape(4,4),index=index,
                 columns=columns)
data
Out[36]:
AC BC CC DC
AI 3 1 3 8
BI 1 4 6 6
CI 2 5 5 5
DI 6 2 8 8
In [37]:
data.min()
Out[37]:
AC    1
BC    1
CC    3
DC    5
dtype: int32
In [38]:
data.mean()
Out[38]:
AC    3.00
BC    3.00
CC    5.50
DC    6.75
dtype: float64
In [39]:
data.max()
Out[39]:
AC    6
BC    5
CC    8
DC    8
dtype: int32
In [40]:
data.median()
Out[40]:
AC    2.5
BC    3.0
CC    5.5
DC    7.0
dtype: float64
In [41]:
data.std()
Out[41]:
AC    2.160247
BC    1.825742
CC    2.081666
DC    1.500000
dtype: float64
In [42]:
data.count()
Out[42]:
AC    4
BC    4
CC    4
DC    4
dtype: int64
In [43]:
data.mode()
Out[43]:
AC BC CC DC
0 1 1 3 8.0
1 2 2 5 NaN
2 3 4 6 NaN
3 6 5 8 NaN
In [44]:
data.describe()
Out[44]:
AC BC CC DC
count 4.000000 4.000000 4.000000 4.00
mean 3.000000 3.000000 5.500000 6.75
std 2.160247 1.825742 2.081666 1.50
min 1.000000 1.000000 3.000000 5.00
25% 1.750000 1.750000 4.500000 5.75
50% 2.500000 3.000000 5.500000 7.00
75% 3.750000 4.250000 6.500000 8.00
max 6.000000 5.000000 8.000000 8.00
In [45]:
data['CC'].value_counts()
Out[45]:
3    1
6    1
5    1
8    1
Name: CC, dtype: int64

转换与处理时间序列数据¶

image.png

Timestamp作为时间类中最基础的,也是最为常用的类型,在多数情况下,时间相关的字符串都会转换成为 Timestamp。pandas提供了to_datetime函数,能够实现这一目标。

DatetimeIndex与PeriodIndex函数
➢ 除了将数据字原始DataFrame中直接转换为Timestamp格式外,还可以将数据单独提取出来将其转换为DatetimeIndex或者PeriodIndex。
➢ 转换为PeriodIndex的时候需要注意,需要通过freq参数指定时间间隔,常用的时间间隔有Y为年,M为月,D为日,H为小时,T为分钟,S为秒。两个函数可以用来转换数据还可以用来创建时间序列数据,其参数非常类似。

pandas.to_datetime()¶

In [46]:
data=pd.read_csv('meal_order_info.csv',encoding='gbk')
data['lock_time'].head()
Out[46]:
0    2016/8/1 11:11:46
1    2016/8/1 11:31:55
2    2016/8/1 12:54:37
3    2016/8/1 13:08:20
4    2016/8/1 13:07:16
Name: lock_time, dtype: object
In [47]:
pd.to_datetime(data['lock_time']).head()
Out[47]:
0   2016-08-01 11:11:46
1   2016-08-01 11:31:55
2   2016-08-01 12:54:37
3   2016-08-01 13:08:20
4   2016-08-01 13:07:16
Name: lock_time, dtype: datetime64[ns]
In [48]:
data_index=data
data_index['lock_time'].head()
Out[48]:
0    2016/8/1 11:11:46
1    2016/8/1 11:31:55
2    2016/8/1 12:54:37
3    2016/8/1 13:08:20
4    2016/8/1 13:07:16
Name: lock_time, dtype: object

image.png

DatetimeIndex与PeriodIndex函数
➢ 除了将数据字原始DataFrame中直接转换为Timestamp格式外,还可以将数据单独提取出来将其转换为
DatetimeIndex或者PeriodIndex。 ➢ 转换为PeriodIndex的时候需要注意,需要通过freq参数指定时间间隔,常用的时间间隔有Y为年,M为 月,D为日,H为小时,T为分钟,S为秒。两个函数可以用来转换数据还可以用来创建时间序列数据,其 参数非常类似。

In [49]:
pd.DatetimeIndex(data_index['lock_time'])
Out[49]:
DatetimeIndex(['2016-08-01 11:11:46', '2016-08-01 11:31:55',
               '2016-08-01 12:54:37', '2016-08-01 13:08:20',
               '2016-08-01 13:07:16', '2016-08-01 13:23:42',
               '2016-08-01 13:34:18', '2016-08-01 13:50:16',
               '2016-08-01 17:18:20', '2016-08-01 17:44:27',
               ...
               '2016-08-31 18:18:31', '2016-08-31 18:42:42',
               '2016-08-31 18:57:56', '2016-08-31 19:19:31',
               '2016-08-31 20:39:13', '2016-08-31 21:31:48',
               '2016-08-31 21:56:12', '2016-08-31 21:33:34',
               '2016-08-31 21:55:39', '2016-08-31 21:32:56'],
              dtype='datetime64[ns]', name='lock_time', length=945, freq=None)
In [50]:
# freq 可选时间
# freq : str or period object, optional
#     One of pandas period strings or corresponding objects.
#     year : int, array, or Series, default None
#     month : int, array, or Series, default None
#     quarter : int, array, or Series, default None
#     day : int, array, or Series, default None
#     hour : int, array, or Series, default None
#     minute : int, array, or Series, default None
#     second : int, array, or Series, default None
#     dtype : str or PeriodDtype, default None

#     freq='s' 精确到秒
data_period=data
pd.PeriodIndex(data_period['lock_time'],freq='h')
Out[50]:
PeriodIndex(['2016-08-01 11:00', '2016-08-01 11:00', '2016-08-01 12:00',
             '2016-08-01 13:00', '2016-08-01 13:00', '2016-08-01 13:00',
             '2016-08-01 13:00', '2016-08-01 13:00', '2016-08-01 17:00',
             '2016-08-01 17:00',
             ...
             '2016-08-31 18:00', '2016-08-31 18:00', '2016-08-31 18:00',
             '2016-08-31 19:00', '2016-08-31 20:00', '2016-08-31 21:00',
             '2016-08-31 21:00', '2016-08-31 21:00', '2016-08-31 21:00',
             '2016-08-31 21:00'],
            dtype='period[H]', name='lock_time', length=945)
In [51]:
# pd.PeriodIndex?

image.png

In [52]:
data=pd.read_csv('meal_order_info.csv',encoding='gbk')
data=pd.to_datetime(data['lock_time'])
data
Out[52]:
0     2016-08-01 11:11:46
1     2016-08-01 11:31:55
2     2016-08-01 12:54:37
3     2016-08-01 13:08:20
4     2016-08-01 13:07:16
              ...        
940   2016-08-31 21:31:48
941   2016-08-31 21:56:12
942   2016-08-31 21:33:34
943   2016-08-31 21:55:39
944   2016-08-31 21:32:56
Name: lock_time, Length: 945, dtype: datetime64[ns]
In [53]:
data[0].year,data[0].month,data[0].day
Out[53]:
(2016, 8, 1)
In [54]:
data[0].hour,data[0].minute,data[0].second
Out[54]:
(11, 11, 46)
In [55]:
# 获取所有数据的时间
data.dt.year.head()
Out[55]:
0    2016.0
1    2016.0
2    2016.0
3    2016.0
4    2016.0
Name: lock_time, dtype: float64
In [56]:
# 获取所有数据的时间
data.dt.month.head()
Out[56]:
0    8.0
1    8.0
2    8.0
3    8.0
4    8.0
Name: lock_time, dtype: float64

Timedelta类
➢ 使用Timedelta ,可以很轻松地实现在某个时间上加减一段时间 。
➢ 能够直接对两个时间序列进行相减,从而得出一个Timedelta。

image.png

In [57]:
data=pd.read_csv('meal_order_info.csv',encoding='gbk')
data=pd.to_datetime(data['lock_time'])
data
Out[57]:
0     2016-08-01 11:11:46
1     2016-08-01 11:31:55
2     2016-08-01 12:54:37
3     2016-08-01 13:08:20
4     2016-08-01 13:07:16
              ...        
940   2016-08-31 21:31:48
941   2016-08-31 21:56:12
942   2016-08-31 21:33:34
943   2016-08-31 21:55:39
944   2016-08-31 21:32:56
Name: lock_time, Length: 945, dtype: datetime64[ns]
In [58]:
# 时间运算 
#  weeks, days, hours, minutes, 
#  seconds, milliseconds, microseconds,
#  nanoseconds
data[1]+pd.Timedelta(weeks=3)
Out[58]:
Timestamp('2016-08-22 11:31:55')

使用分组聚合进行组内计算¶

image.png

by参数的特别说明
➢ 如果传入的是一个函数则对索引进行计算并分组。
➢ 如果传入的是一个字典或者Series则字典或者Series的值用来做分组依据。
➢ 如果传入一个NumPy数组则数据的元素作为分组依据。
➢ 如果传入的是字符串或者字符串列表则使用这些字符串所代表的字段作为分组依据。

DataFrame.groupby()¶

In [71]:
data=pd.read_excel('meal_order_detail.xlsx')
data.head()
Out[71]:
detail_id order_id dishes_id logicprn_name parent_class_name dishes_name itemis_add counts amounts cost place_order_time discount_amt discount_reason kick_back add_inprice add_info bar_code picture_file emp_id
0 2956 417 610062 NaN NaN 蒜蓉生蚝 0 1 49 NaN 2016-08-01 11:05:36 NaN NaN NaN 0 NaN NaN caipu/104001.jpg 1442
1 2958 417 609957 NaN NaN 蒙古烤羊腿_x000D_\n_x000D_\n_x000D_\n 0 1 48 NaN 2016-08-01 11:07:07 NaN NaN NaN 0 NaN NaN caipu/202003.jpg 1442
2 2961 417 609950 NaN NaN 大蒜苋菜 0 1 30 NaN 2016-08-01 11:07:40 NaN NaN NaN 0 NaN NaN caipu/303001.jpg 1442
3 2966 417 610038 NaN NaN 芝麻烤紫菜 0 1 25 NaN 2016-08-01 11:11:11 NaN NaN NaN 0 NaN NaN caipu/105002.jpg 1442
4 2968 417 610003 NaN NaN 蒜香包 0 1 13 NaN 2016-08-01 11:11:30 NaN NaN NaN 0 NaN NaN caipu/503002.jpg 1442
In [77]:
group_data=data[['order_id','counts','amounts']].groupby(by='order_id')
group_data
Out[77]:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000028C5ECE0370>

image.png

image.png

DataFrame.agg()¶

In [83]:
# 以列表方式传入参数
group_data.agg(['min','max']).head()
Out[83]:
counts amounts
min max min max
order_id
137 1 4 1 99
165 1 2 9 178
166 1 2 6 109
171 1 4 10 65
177 1 1 16 48

image.png

In [82]:
# 以键值对的形式传入
group_data.agg({'counts':['min','max'],'amounts':'mean'}).head()
Out[82]:
counts amounts
min max mean
order_id
137 1 4 32.333333
165 1 2 52.944444
166 1 2 48.200000
171 1 4 36.285714
177 1 1 34.250000

image.png

image.png

In [88]:
# 只能使用一个函数、且只能作用于整个数据框
group_data.apply('mean').head()
Out[88]:
counts amounts
order_id
137 1.500000 32.333333
165 1.166667 52.944444
166 1.400000 48.200000
171 1.428571 36.285714
177 1.000000 34.250000
In [89]:
# group_data.apply?

创建透视表和交叉表¶

pandas.pivot_table()¶

image.png

使用povit_table函数创建透视表
➢ 在不特殊指定聚合函数aggfunc时,会默认使用numpy.mean进行聚合运算,numpy.mean会自动过滤 掉非数值类型数据。可以通过指定aggfunc参数修改聚合函数。
➢ 和groupby方法分组的时候相同,pivot_table函数在创建透视表的时候分组键index可以有多个。
➢ 通过设置columns参数可以指定列分组。
➢ 当全部数据列数很多时,若只想要显示某列,可以通过指定values参数来实现。
➢ 当某些数据不存在时,会自动填充NaN,因此可以指定fill_value参数,表示当存在缺失值时,以指定数 值进行填充。
➢ 可以更改margins参数,查看汇总数据。

In [95]:
data=pd.read_excel('meal_order_detail.xlsx')
data.head()
Out[95]:
detail_id order_id dishes_id logicprn_name parent_class_name dishes_name itemis_add counts amounts cost place_order_time discount_amt discount_reason kick_back add_inprice add_info bar_code picture_file emp_id
0 2956 417 610062 NaN NaN 蒜蓉生蚝 0 1 49 NaN 2016-08-01 11:05:36 NaN NaN NaN 0 NaN NaN caipu/104001.jpg 1442
1 2958 417 609957 NaN NaN 蒙古烤羊腿_x000D_\n_x000D_\n_x000D_\n 0 1 48 NaN 2016-08-01 11:07:07 NaN NaN NaN 0 NaN NaN caipu/202003.jpg 1442
2 2961 417 609950 NaN NaN 大蒜苋菜 0 1 30 NaN 2016-08-01 11:07:40 NaN NaN NaN 0 NaN NaN caipu/303001.jpg 1442
3 2966 417 610038 NaN NaN 芝麻烤紫菜 0 1 25 NaN 2016-08-01 11:11:11 NaN NaN NaN 0 NaN NaN caipu/105002.jpg 1442
4 2968 417 610003 NaN NaN 蒜香包 0 1 13 NaN 2016-08-01 11:11:30 NaN NaN NaN 0 NaN NaN caipu/503002.jpg 1442
In [98]:
# data[['order_id','counts','dishes_name']] 表示数据
# index='order_id'  行索引
# columns='dishes_name'  列索引
# aggfunc='count'  应用函数
# fill_value=0  设置默认值为0
pd.pivot_table(data[['order_id','counts','dishes_name']],
              index='order_id',columns='dishes_name',
              aggfunc='count',fill_value=0).head()
Out[98]:
counts
dishes_name 42度海之蓝 北冰洋汽水 38度剑南春 50度古井贡酒 52度泸州老窖 53度茅台 一品香酥藕 三丝鳝鱼 三色凉拌手撕兔 不加一滴油的酸奶蛋糕 ... 香辣腐乳炒虾 香酥两吃大虾 鱼香肉丝拌面 鲜美鳝鱼 鸡蛋、肉末肠粉 麻辣小龙虾 黄尾袋鼠西拉子红葡萄酒 黄油曲奇饼干 黄花菜炒木耳 黑米恋上葡萄
order_id
137 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 1 0 0 0 0
165 0 0 1 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
166 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
171 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
177 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 154 columns

pandas.crosstab()¶

➢ 交叉表是一种特殊的透视表,主要用于计算分组频率。
➢ 由于交叉表是透视表的一种,其参数基本保持一致,不同之处在于crosstab函数中的index,columns, values填入的都是对应的从Dataframe中取出的某一列。
➢ pandas.crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, dropna=True, normalize=False)

image.png

In [100]:
data.head()
Out[100]:
detail_id order_id dishes_id logicprn_name parent_class_name dishes_name itemis_add counts amounts cost place_order_time discount_amt discount_reason kick_back add_inprice add_info bar_code picture_file emp_id
0 2956 417 610062 NaN NaN 蒜蓉生蚝 0 1 49 NaN 2016-08-01 11:05:36 NaN NaN NaN 0 NaN NaN caipu/104001.jpg 1442
1 2958 417 609957 NaN NaN 蒙古烤羊腿_x000D_\n_x000D_\n_x000D_\n 0 1 48 NaN 2016-08-01 11:07:07 NaN NaN NaN 0 NaN NaN caipu/202003.jpg 1442
2 2961 417 609950 NaN NaN 大蒜苋菜 0 1 30 NaN 2016-08-01 11:07:40 NaN NaN NaN 0 NaN NaN caipu/303001.jpg 1442
3 2966 417 610038 NaN NaN 芝麻烤紫菜 0 1 25 NaN 2016-08-01 11:11:11 NaN NaN NaN 0 NaN NaN caipu/105002.jpg 1442
4 2968 417 610003 NaN NaN 蒜香包 0 1 13 NaN 2016-08-01 11:11:30 NaN NaN NaN 0 NaN NaN caipu/503002.jpg 1442
In [103]:
# index=data['order_id'], 设置行
#    columns=data['dishes_name'],   设置列
#   values=data['counts'], 设置值
#   aggfunc='count')  设置求值函数
#  fillna(0)   填充确实值
pd.crosstab(index=data['order_id'],columns=data['dishes_name'],
           values=data['counts'],aggfunc='count').fillna(0).head()
Out[103]:
dishes_name 42度海之蓝 北冰洋汽水 38度剑南春 50度古井贡酒 52度泸州老窖 53度茅台 一品香酥藕 三丝鳝鱼 三色凉拌手撕兔 不加一滴油的酸奶蛋糕 ... 香辣腐乳炒虾 香酥两吃大虾 鱼香肉丝拌面 鲜美鳝鱼 鸡蛋、肉末肠粉 麻辣小龙虾 黄尾袋鼠西拉子红葡萄酒 黄油曲奇饼干 黄花菜炒木耳 黑米恋上葡萄
order_id
137 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
165 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
166 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
171 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
177 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 154 columns

数据预处理¶

数据合并¶

表堆叠¶

pandas.concat()¶

image.png

image.png

image.png

image.png

In [1]:
import pandas as pd
import numpy as np
In [2]:
# np.nan  表示缺失值
df1 = dict({'name':['Jackosn-1','Jason-1','Black-1',
                    np.nan,'Lucy-1',np.nan],
          'score':[90, 60, 50, 100, np.nan, 99]})
df2 = dict({'name':['Jackosn-2','Jason-2','Black-2',
                    np.nan,'Lucy-2',np.nan],
          'score':[90, 60, 50, 100, np.nan, 99]})
data1 = pd.DataFrame(df1)
data2 = pd.DataFrame(df2)
data1,data2
Out[2]:
(        name  score
 0  Jackosn-1   90.0
 1    Jason-1   60.0
 2    Black-1   50.0
 3        NaN  100.0
 4     Lucy-1    NaN
 5        NaN   99.0,
         name  score
 0  Jackosn-2   90.0
 1    Jason-2   60.0
 2    Black-2   50.0
 3        NaN  100.0
 4     Lucy-2    NaN
 5        NaN   99.0)
In [3]:
# axis=1 表示横向堆叠, axis=0表示纵向堆叠
pd.concat([data1,data2],axis=1)
Out[3]:
name score name score
0 Jackosn-1 90.0 Jackosn-2 90.0
1 Jason-1 60.0 Jason-2 60.0
2 Black-1 50.0 Black-2 50.0
3 NaN 100.0 NaN 100.0
4 Lucy-1 NaN Lucy-2 NaN
5 NaN 99.0 NaN 99.0
In [5]:
data1['add']=11
data1
Out[5]:
name score add
0 Jackosn-1 90.0 11
1 Jason-1 60.0 11
2 Black-1 50.0 11
3 NaN 100.0 11
4 Lucy-1 NaN 11
5 NaN 99.0 11
In [8]:
# join='inner' 表示求交集
pd.concat([data1,data2],join='inner')
Out[8]:
name score
0 Jackosn-1 90.0
1 Jason-1 60.0
2 Black-1 50.0
3 NaN 100.0
4 Lucy-1 NaN
5 NaN 99.0
0 Jackosn-2 90.0
1 Jason-2 60.0
2 Black-2 50.0
3 NaN 100.0
4 Lucy-2 NaN
5 NaN 99.0
In [10]:
# join='outer'  表示求并集
pd.concat([data1,data2],join='outer')
Out[10]:
name score add
0 Jackosn-1 90.0 11.0
1 Jason-1 60.0 11.0
2 Black-1 50.0 11.0
3 NaN 100.0 11.0
4 Lucy-1 NaN 11.0
5 NaN 99.0 11.0
0 Jackosn-2 90.0 NaN
1 Jason-2 60.0 NaN
2 Black-2 50.0 NaN
3 NaN 100.0 NaN
4 Lucy-2 NaN NaN
5 NaN 99.0 NaN

pandas.append()¶

In [11]:
data1,data2
Out[11]:
(        name  score  add
 0  Jackosn-1   90.0   11
 1    Jason-1   60.0   11
 2    Black-1   50.0   11
 3        NaN  100.0   11
 4     Lucy-1    NaN   11
 5        NaN   99.0   11,
         name  score
 0  Jackosn-2   90.0
 1    Jason-2   60.0
 2    Black-2   50.0
 3        NaN  100.0
 4     Lucy-2    NaN
 5        NaN   99.0)
In [19]:
data=data1.drop('add',axis=1)
data
Out[19]:
name score
0 Jackosn-1 90.0
1 Jason-1 60.0
2 Black-1 50.0
3 NaN 100.0
4 Lucy-1 NaN
5 NaN 99.0
In [20]:
data.append(data2)
C:\Users\86195\AppData\Local\Temp\ipykernel_3676\8019119.py:1: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  data.append(data2)
Out[20]:
name score
0 Jackosn-1 90.0
1 Jason-1 60.0
2 Black-1 50.0
3 NaN 100.0
4 Lucy-1 NaN
5 NaN 99.0
0 Jackosn-2 90.0
1 Jason-2 60.0
2 Black-2 50.0
3 NaN 100.0
4 Lucy-2 NaN
5 NaN 99.0

pandas.merge()¶

image.png

image.png

In [27]:
data1=pd.read_csv('order_sample.csv')
data2=pd.read_csv('dishes_info.csv')
data1.head(),data2.head()
Out[27]:
(   detail_id  order_id  dishes_id     place_order_time
 0       2956       417     610062  2016-08-01 11:05:36
 1       2958       417     609957  2016-08-01 11:07:07
 2       2961       417     609950  2016-08-01 11:07:40
 3       2966       417     610038  2016-08-01 11:11:11
 4       2968       417     610003  2016-08-01 11:11:30,
    dishes_id        dishes_name  amounts
 0     610062               蒜蓉生蚝       49
 1     609957  蒙古烤羊腿\r\n\r\n\r\n       48
 2     609950               大蒜苋菜       30
 3     610038              芝麻烤紫菜       25
 4     610003                蒜香包       13)
In [29]:
# left  表示左表,  right  表示右表
# how='left'  表示向左表看齐,右表出现值的缺失则按照左表自动补齐
# on 当主键名称一致时使用,
# 主键不一致时,需要分别指定个表的主键:left_on  right_on
pd.merge(left=data1,right=data2,on='dishes_id').head()
Out[29]:
detail_id order_id dishes_id place_order_time dishes_name amounts
0 2956 417 610062 2016-08-01 11:05:36 蒜蓉生蚝 49
1 3574 467 610062 2016-08-01 19:49:48 蒜蓉生蚝 49
2 893 171 610062 2016-08-02 13:55:27 蒜蓉生蚝 49
3 5691 673 610062 2016-08-04 20:38:09 蒜蓉生蚝 49
4 4626 578 610062 2016-08-05 21:35:29 蒜蓉生蚝 49

pandas.join()¶

image.png

In [44]:
#  on='dishes_id',  指定合并的键 
#  lsuffix='dishes_id',  添加左表的后缀
#  rsuffix='dishes_name   添加右表的后缀
data1.join(data2,on='dishes_id',
           lsuffix='_add_l',
           rsuffix='_add_r').head()
Out[44]:
detail_id order_id dishes_id_add_l place_order_time dishes_id_add_r dishes_name amounts
0 2956 417 610062 2016-08-01 11:05:36 NaN NaN NaN
1 2958 417 609957 2016-08-01 11:07:07 NaN NaN NaN
2 2961 417 609950 2016-08-01 11:07:40 NaN NaN NaN
3 2966 417 610038 2016-08-01 11:11:11 NaN NaN NaN
4 2968 417 610003 2016-08-01 11:11:30 NaN NaN NaN

pandas.combine_first()¶

image.png

image.png

In [45]:
# np.nan  表示缺失值
df1 = dict({'name':['Jackosn-1','Jason-1','Black-1',
                    np.nan,'Lucy-1',np.nan],
          'score':[90, 60, 50, 100, np.nan, 99]})
df2 = dict({'name':['Jackosn-2','Jason-2','Black-2',
                    np.nan,'Lucy-2',np.nan],
          'score':[90, 60, 50, 100, np.nan, 99]})
data1 = pd.DataFrame(df1)
data2 = pd.DataFrame(df2)
data1,data2
Out[45]:
(        name  score
 0  Jackosn-1   90.0
 1    Jason-1   60.0
 2    Black-1   50.0
 3        NaN  100.0
 4     Lucy-1    NaN
 5        NaN   99.0,
         name  score
 0  Jackosn-2   90.0
 1    Jason-2   60.0
 2    Black-2   50.0
 3        NaN  100.0
 4     Lucy-2    NaN
 5        NaN   99.0)
In [46]:
data1.combine_first(data2)
Out[46]:
name score
0 Jackosn-1 90.0
1 Jason-1 60.0
2 Black-1 50.0
3 NaN 100.0
4 Lucy-1 NaN
5 NaN 99.0

数据清洗¶

DataFrame.drop_duplicates()实现数据去重¶

image.png

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
In [3]:
data=pd.read_csv('detail_duplicates.csv')
data.head()
Out[3]:
order_id dishes_name counts amounts
0 NaN 蒜蓉生蚝 1.0 49.0
1 NaN NaN 1.0 48.0
2 417.0 大蒜苋菜 1.0 30.0
3 417.0 芝麻烤紫菜 1.0 25.0
4 417.0 蒜香包 1.0 13.0
In [5]:
# subset 设置去重列,可以出入多个数据
# keep 表示当出现重复数据时,去除第一个还是最后一个元素
# inplace 是否作用于源数据 默认为False
data.drop_duplicates(subset=['order_id'],
                    keep='last',
                    inplace=False)
Out[5]:
order_id dishes_name counts amounts
5 NaN NaN NaN NaN
9 301.0 番茄有机花菜 1.0 32.0
10 417.0 蒙古烤羊腿\r\n\r\n\r\n 1.0 48.0
In [6]:
# subset 传入多个重复列
data.drop_duplicates(subset=['order_id','dishes_name'])
Out[6]:
order_id dishes_name counts amounts
0 NaN 蒜蓉生蚝 1.0 49.0
1 NaN NaN 1.0 48.0
2 417.0 大蒜苋菜 1.0 30.0
3 417.0 芝麻烤紫菜 1.0 25.0
4 417.0 蒜香包 1.0 13.0
6 301.0 香烤牛排\r\n 1.0 55.0
8 301.0 芝麻烤紫菜 1.0 25.0
9 301.0 番茄有机花菜 1.0 32.0
10 417.0 蒙古烤羊腿\r\n\r\n\r\n 1.0 48.0

DataFrame.isnull && DataFrame.notnull 实现数据缺失检测¶

In [8]:
# 数据是否缺失
data.isnull().head()
Out[8]:
order_id dishes_name counts amounts
0 True False False False
1 True True False False
2 False False False False
3 False False False False
4 False False False False
In [10]:
# 数据是否未缺失
data.notnull().head()
Out[10]:
order_id dishes_name counts amounts
0 False True True True
1 False False True True
2 True True True True
3 True True True True
4 True True True True

DataFrame.dropna()删除缺失值¶

image.png

In [12]:
data
Out[12]:
order_id dishes_name counts amounts
0 NaN 蒜蓉生蚝 1.0 49.0
1 NaN NaN 1.0 48.0
2 417.0 大蒜苋菜 1.0 30.0
3 417.0 芝麻烤紫菜 1.0 25.0
4 417.0 蒜香包 1.0 13.0
5 NaN NaN NaN NaN
6 301.0 香烤牛排\r\n 1.0 55.0
7 417.0 芝麻烤紫菜 1.0 25.0
8 301.0 芝麻烤紫菜 1.0 25.0
9 301.0 番茄有机花菜 1.0 32.0
10 417.0 蒙古烤羊腿\r\n\r\n\r\n 1.0 48.0
In [15]:
# axis  删除轴向
# how  删除形式  any、all
# subset  指定删除依据列,可出入多个
data.dropna(axis=0,how='all',
           subset=['dishes_name'])
Out[15]:
order_id dishes_name counts amounts
0 NaN 蒜蓉生蚝 1.0 49.0
2 417.0 大蒜苋菜 1.0 30.0
3 417.0 芝麻烤紫菜 1.0 25.0
4 417.0 蒜香包 1.0 13.0
6 301.0 香烤牛排\r\n 1.0 55.0
7 417.0 芝麻烤紫菜 1.0 25.0
8 301.0 芝麻烤紫菜 1.0 25.0
9 301.0 番茄有机花菜 1.0 32.0
10 417.0 蒙古烤羊腿\r\n\r\n\r\n 1.0 48.0

DataFrame.fillna()替换缺失值¶

image.png

image.png

In [16]:
data.head()
Out[16]:
order_id dishes_name counts amounts
0 NaN 蒜蓉生蚝 1.0 49.0
1 NaN NaN 1.0 48.0
2 417.0 大蒜苋菜 1.0 30.0
3 417.0 芝麻烤紫菜 1.0 25.0
4 417.0 蒜香包 1.0 13.0
In [17]:
# 指定值填充缺失值
data.fillna(13)
Out[17]:
order_id dishes_name counts amounts
0 13.0 蒜蓉生蚝 1.0 49.0
1 13.0 13 1.0 48.0
2 417.0 大蒜苋菜 1.0 30.0
3 417.0 芝麻烤紫菜 1.0 25.0
4 417.0 蒜香包 1.0 13.0
5 13.0 13 13.0 13.0
6 301.0 香烤牛排\r\n 1.0 55.0
7 417.0 芝麻烤紫菜 1.0 25.0
8 301.0 芝麻烤紫菜 1.0 25.0
9 301.0 番茄有机花菜 1.0 32.0
10 417.0 蒙古烤羊腿\r\n\r\n\r\n 1.0 48.0
In [18]:
# 使用表格中的数据填充缺失值
# 自上向下填充
data.fillna(method='ffill')
Out[18]:
order_id dishes_name counts amounts
0 NaN 蒜蓉生蚝 1.0 49.0
1 NaN 蒜蓉生蚝 1.0 48.0
2 417.0 大蒜苋菜 1.0 30.0
3 417.0 芝麻烤紫菜 1.0 25.0
4 417.0 蒜香包 1.0 13.0
5 417.0 蒜香包 1.0 13.0
6 301.0 香烤牛排\r\n 1.0 55.0
7 417.0 芝麻烤紫菜 1.0 25.0
8 301.0 芝麻烤紫菜 1.0 25.0
9 301.0 番茄有机花菜 1.0 32.0
10 417.0 蒙古烤羊腿\r\n\r\n\r\n 1.0 48.0
In [22]:
# 使用表格中的数据填充缺失值
# 自下向上填充
data.fillna(method='bfill')
Out[22]:
order_id dishes_name counts amounts
0 417.0 蒜蓉生蚝 1.0 49.0
1 417.0 大蒜苋菜 1.0 48.0
2 417.0 大蒜苋菜 1.0 30.0
3 417.0 芝麻烤紫菜 1.0 25.0
4 417.0 蒜香包 1.0 13.0
5 301.0 香烤牛排\r\n 1.0 55.0
6 301.0 香烤牛排\r\n 1.0 55.0
7 417.0 芝麻烤紫菜 1.0 25.0
8 301.0 芝麻烤紫菜 1.0 25.0
9 301.0 番茄有机花菜 1.0 32.0
10 417.0 蒙古烤羊腿\r\n\r\n\r\n 1.0 48.0

数据异常检测¶

image.png

image.png

In [24]:
data = pd.read_excel('meal_order_detail.xlsx')
data
Out[24]:
detail_id order_id dishes_id logicprn_name parent_class_name dishes_name itemis_add counts amounts cost place_order_time discount_amt discount_reason kick_back add_inprice add_info bar_code picture_file emp_id
0 2956 417 610062 NaN NaN 蒜蓉生蚝 0 1 49 NaN 2016-08-01 11:05:36 NaN NaN NaN 0 NaN NaN caipu/104001.jpg 1442
1 2958 417 609957 NaN NaN 蒙古烤羊腿_x000D_\n_x000D_\n_x000D_\n 0 1 48 NaN 2016-08-01 11:07:07 NaN NaN NaN 0 NaN NaN caipu/202003.jpg 1442
2 2961 417 609950 NaN NaN 大蒜苋菜 0 1 30 NaN 2016-08-01 11:07:40 NaN NaN NaN 0 NaN NaN caipu/303001.jpg 1442
3 2966 417 610038 NaN NaN 芝麻烤紫菜 0 1 25 NaN 2016-08-01 11:11:11 NaN NaN NaN 0 NaN NaN caipu/105002.jpg 1442
4 2968 417 610003 NaN NaN 蒜香包 0 1 13 NaN 2016-08-01 11:11:30 NaN NaN NaN 0 NaN NaN caipu/503002.jpg 1442
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2774 6750 774 610011 NaN NaN 白饭/大碗 0 1 10 NaN 2016-08-10 21:56:24 NaN NaN NaN 0 NaN NaN caipu/601005.jpg 1138
2775 6742 774 609996 NaN NaN 牛尾汤 0 1 40 NaN 2016-08-10 21:56:48 NaN NaN NaN 0 NaN NaN caipu/201006.jpg 1138
2776 6756 774 609949 NaN NaN 意文柠檬汁 0 1 13 NaN 2016-08-10 22:01:52 NaN NaN NaN 0 NaN NaN caipu/404005.jpg 1138
2777 6763 774 610014 NaN NaN 金玉良缘 0 1 30 NaN 2016-08-10 22:03:58 NaN NaN NaN 0 NaN NaN caipu/302003.jpg 1138
2778 6764 774 610017 NaN NaN 酸辣藕丁 0 1 33 NaN 2016-08-10 22:04:30 NaN NaN NaN 0 NaN NaN caipu/302006.jpg 1138

2779 rows × 19 columns

In [26]:
data['amounts'].isnull().sum()
Out[26]:
0
In [27]:
plt.boxplot(data['amounts'])
plt.show()
# plt.boxplot?
In [28]:
# 自定义处理异常函数
def deal_with(x):
    QU = x.quantile(0.75)
    QL = x.quantile(0.25)
    IQR = QU - QL
    x[(x > (QU + 1.5*IQR)) | 
      (x < (QL - 1.5*IQR))] = np.nan
    return x
In [29]:
data['amounts'].isnull().sum()
Out[29]:
0
In [31]:
print(deal_with(data['amounts']).isnull().sum())
173

数据标准化¶

离差标准化¶

image.png

image.png

In [33]:
data = pd.read_excel('meal_order_detail.xlsx')
data.head()
Out[33]:
detail_id order_id dishes_id logicprn_name parent_class_name dishes_name itemis_add counts amounts cost place_order_time discount_amt discount_reason kick_back add_inprice add_info bar_code picture_file emp_id
0 2956 417 610062 NaN NaN 蒜蓉生蚝 0 1 49 NaN 2016-08-01 11:05:36 NaN NaN NaN 0 NaN NaN caipu/104001.jpg 1442
1 2958 417 609957 NaN NaN 蒙古烤羊腿_x000D_\n_x000D_\n_x000D_\n 0 1 48 NaN 2016-08-01 11:07:07 NaN NaN NaN 0 NaN NaN caipu/202003.jpg 1442
2 2961 417 609950 NaN NaN 大蒜苋菜 0 1 30 NaN 2016-08-01 11:07:40 NaN NaN NaN 0 NaN NaN caipu/303001.jpg 1442
3 2966 417 610038 NaN NaN 芝麻烤紫菜 0 1 25 NaN 2016-08-01 11:11:11 NaN NaN NaN 0 NaN NaN caipu/105002.jpg 1442
4 2968 417 610003 NaN NaN 蒜香包 0 1 13 NaN 2016-08-01 11:11:30 NaN NaN NaN 0 NaN NaN caipu/503002.jpg 1442
In [34]:
# 自定义函数实现离差标准化
def deviation_standar(x):
    return (x-x.min())/(x.max()-x.min())
deviation_standar(data[['counts','amounts']])
Out[34]:
counts amounts
0 0.0 0.271186
1 0.0 0.265537
2 0.0 0.163842
3 0.0 0.135593
4 0.0 0.067797
... ... ...
2774 0.0 0.050847
2775 0.0 0.220339
2776 0.0 0.067797
2777 0.0 0.163842
2778 0.0 0.180791

2779 rows × 2 columns

标准差标准化¶

image.png

In [36]:
data.head()
Out[36]:
detail_id order_id dishes_id logicprn_name parent_class_name dishes_name itemis_add counts amounts cost place_order_time discount_amt discount_reason kick_back add_inprice add_info bar_code picture_file emp_id
0 2956 417 610062 NaN NaN 蒜蓉生蚝 0 1 49 NaN 2016-08-01 11:05:36 NaN NaN NaN 0 NaN NaN caipu/104001.jpg 1442
1 2958 417 609957 NaN NaN 蒙古烤羊腿_x000D_\n_x000D_\n_x000D_\n 0 1 48 NaN 2016-08-01 11:07:07 NaN NaN NaN 0 NaN NaN caipu/202003.jpg 1442
2 2961 417 609950 NaN NaN 大蒜苋菜 0 1 30 NaN 2016-08-01 11:07:40 NaN NaN NaN 0 NaN NaN caipu/303001.jpg 1442
3 2966 417 610038 NaN NaN 芝麻烤紫菜 0 1 25 NaN 2016-08-01 11:11:11 NaN NaN NaN 0 NaN NaN caipu/105002.jpg 1442
4 2968 417 610003 NaN NaN 蒜香包 0 1 13 NaN 2016-08-01 11:11:30 NaN NaN NaN 0 NaN NaN caipu/503002.jpg 1442
In [37]:
# 自定义函数实现标准差标准化
def standarzation_standar(x):
    return (x-x.mean())/x.std()
standarzation_standar(data[['counts','amounts']])
Out[37]:
counts amounts
0 -0.177784 0.099510
1 -0.177784 0.072343
2 -0.177784 -0.416674
3 -0.177784 -0.552512
4 -0.177784 -0.878523
... ... ...
2774 -0.177784 -0.960026
2775 -0.177784 -0.144998
2776 -0.177784 -0.878523
2777 -0.177784 -0.416674
2778 -0.177784 -0.335171

2779 rows × 2 columns

小数定标标准化¶

image.png

In [38]:
data.head()
Out[38]:
detail_id order_id dishes_id logicprn_name parent_class_name dishes_name itemis_add counts amounts cost place_order_time discount_amt discount_reason kick_back add_inprice add_info bar_code picture_file emp_id
0 2956 417 610062 NaN NaN 蒜蓉生蚝 0 1 49 NaN 2016-08-01 11:05:36 NaN NaN NaN 0 NaN NaN caipu/104001.jpg 1442
1 2958 417 609957 NaN NaN 蒙古烤羊腿_x000D_\n_x000D_\n_x000D_\n 0 1 48 NaN 2016-08-01 11:07:07 NaN NaN NaN 0 NaN NaN caipu/202003.jpg 1442
2 2961 417 609950 NaN NaN 大蒜苋菜 0 1 30 NaN 2016-08-01 11:07:40 NaN NaN NaN 0 NaN NaN caipu/303001.jpg 1442
3 2966 417 610038 NaN NaN 芝麻烤紫菜 0 1 25 NaN 2016-08-01 11:11:11 NaN NaN NaN 0 NaN NaN caipu/105002.jpg 1442
4 2968 417 610003 NaN NaN 蒜香包 0 1 13 NaN 2016-08-01 11:11:30 NaN NaN NaN 0 NaN NaN caipu/503002.jpg 1442
In [39]:
# 自定义函数实现小数定标标准化
def decimal_standar(x):
    k = np.ceil(np.log10(x.abs().max()))
    return x/k
decimal_standar(data[['counts','amounts']])
Out[39]:
counts amounts
0 1.0 16.333333
1 1.0 16.000000
2 1.0 10.000000
3 1.0 8.333333
4 1.0 4.333333
... ... ...
2774 1.0 3.333333
2775 1.0 13.333333
2776 1.0 4.333333
2777 1.0 10.000000
2778 1.0 11.000000

2779 rows × 2 columns

装换数据¶

哑变量处理¶

pandas.getdummies()¶

image.png

image.png

image.png

In [41]:
data.head()
Out[41]:
detail_id order_id dishes_id logicprn_name parent_class_name dishes_name itemis_add counts amounts cost place_order_time discount_amt discount_reason kick_back add_inprice add_info bar_code picture_file emp_id
0 2956 417 610062 NaN NaN 蒜蓉生蚝 0 1 49 NaN 2016-08-01 11:05:36 NaN NaN NaN 0 NaN NaN caipu/104001.jpg 1442
1 2958 417 609957 NaN NaN 蒙古烤羊腿_x000D_\n_x000D_\n_x000D_\n 0 1 48 NaN 2016-08-01 11:07:07 NaN NaN NaN 0 NaN NaN caipu/202003.jpg 1442
2 2961 417 609950 NaN NaN 大蒜苋菜 0 1 30 NaN 2016-08-01 11:07:40 NaN NaN NaN 0 NaN NaN caipu/303001.jpg 1442
3 2966 417 610038 NaN NaN 芝麻烤紫菜 0 1 25 NaN 2016-08-01 11:11:11 NaN NaN NaN 0 NaN NaN caipu/105002.jpg 1442
4 2968 417 610003 NaN NaN 蒜香包 0 1 13 NaN 2016-08-01 11:11:30 NaN NaN NaN 0 NaN NaN caipu/503002.jpg 1442
In [43]:
# 实现哑变量处理
pd.get_dummies(data[['amounts','dishes_name']])
Out[43]:
amounts dishes_name_ 42度海之蓝 dishes_name_ 北冰洋汽水 dishes_name_38度剑南春 dishes_name_50度古井贡酒 dishes_name_52度泸州老窖 dishes_name_53度茅台 dishes_name_一品香酥藕 dishes_name_三丝鳝鱼 dishes_name_三色凉拌手撕兔 ... dishes_name_香辣腐乳炒虾 dishes_name_香酥两吃大虾 dishes_name_鱼香肉丝拌面 dishes_name_鲜美鳝鱼 dishes_name_鸡蛋、肉末肠粉 dishes_name_麻辣小龙虾 dishes_name_黄尾袋鼠西拉子红葡萄酒 dishes_name_黄油曲奇饼干 dishes_name_黄花菜炒木耳 dishes_name_黑米恋上葡萄
0 49 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 48 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 30 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 25 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 13 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2774 10 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2775 40 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2776 13 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2777 30 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2778 33 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

2779 rows × 155 columns

离散化连续型数据¶

pandas.cut()¶

image.png

image.png

In [45]:
data.head()
Out[45]:
detail_id order_id dishes_id logicprn_name parent_class_name dishes_name itemis_add counts amounts cost place_order_time discount_amt discount_reason kick_back add_inprice add_info bar_code picture_file emp_id
0 2956 417 610062 NaN NaN 蒜蓉生蚝 0 1 49 NaN 2016-08-01 11:05:36 NaN NaN NaN 0 NaN NaN caipu/104001.jpg 1442
1 2958 417 609957 NaN NaN 蒙古烤羊腿_x000D_\n_x000D_\n_x000D_\n 0 1 48 NaN 2016-08-01 11:07:07 NaN NaN NaN 0 NaN NaN caipu/202003.jpg 1442
2 2961 417 609950 NaN NaN 大蒜苋菜 0 1 30 NaN 2016-08-01 11:07:40 NaN NaN NaN 0 NaN NaN caipu/303001.jpg 1442
3 2966 417 610038 NaN NaN 芝麻烤紫菜 0 1 25 NaN 2016-08-01 11:11:11 NaN NaN NaN 0 NaN NaN caipu/105002.jpg 1442
4 2968 417 610003 NaN NaN 蒜香包 0 1 13 NaN 2016-08-01 11:11:30 NaN NaN NaN 0 NaN NaN caipu/503002.jpg 1442
In [48]:
# data['amounts']   表示需要连续化的数据
# 5  表示连续化的区间个数
pd.cut(data['amounts'],5)
Out[48]:
0        (36.4, 71.8]
1        (36.4, 71.8]
2       (0.823, 36.4]
3       (0.823, 36.4]
4       (0.823, 36.4]
            ...      
2774    (0.823, 36.4]
2775     (36.4, 71.8]
2776    (0.823, 36.4]
2777    (0.823, 36.4]
2778    (0.823, 36.4]
Name: amounts, Length: 2779, dtype: category
Categories (5, interval[float64, right]): [(0.823, 36.4] < (36.4, 71.8] < (71.8, 107.2] < (107.2, 142.6] < (142.6, 178.0]]



Author:Jason Black